Traffic Fatality Data

I have decided to explore the 2015 U.S. Department of Transportation Fatality Analysis Reporting System (FARS) report. I think this is an important data set to understand because 35,092 people died on public roadways in 2015 (nearly 100 per day!!) an increase of 7.2% since 2014. DJ Patil and Mark Rosekind of Obama’s administration issued a public call to action asking citizens to help explore the data.

I was originally looking into all of the data but this report was quickly becoming an unmanageable length so I have re-focused on just the person data file. It is my hope that by shedding light on who is involved in these collisions the data can become more personal than just “35,092 people died” which is hard to conceptualize.

Understanding the data

From the FARS person data file, we have information on 80587 people in 32166 different collisions that had fatalities during 2015. We have 68 features about the people involved in these accidents. The in depth details of each field is available in the FARS Analytical User Guide on pages 240 - 284, I will briefly touch on all the fields as we explore them though so don’t dive into read the documentation yet!

To get a general understanding, here are all the fields:

##  [1] "STATE"      "ST_CASE"    "VE_FORMS"   "VEH_NO"     "PER_NO"    
##  [6] "STR_VEH"    "COUNTY"     "DAY"        "MONTH"      "HOUR"      
## [11] "MINUTE"     "RUR_URB"    "FUNC_SYS"   "HARM_EV"    "MAN_COLL"  
## [16] "SCH_BUS"    "MAKE"       "MAK_MOD"    "BODY_TYP"   "MOD_YEAR"  
## [21] "TOW_VEH"    "SPEC_USE"   "EMER_USE"   "ROLLOVER"   "IMPACT1"   
## [26] "FIRE_EXP"   "AGE"        "SEX"        "PER_TYP"    "INJ_SEV"   
## [31] "SEAT_POS"   "REST_USE"   "REST_MIS"   "AIR_BAG"    "EJECTION"  
## [36] "EJ_PATH"    "EXTRICAT"   "DRINKING"   "ALC_DET"    "ALC_STATUS"
## [41] "ATST_TYP"   "ALC_RES"    "DRUGS"      "DRUG_DET"   "DSTATUS"   
## [46] "DRUGTST1"   "DRUGTST2"   "DRUGTST3"   "DRUGRES1"   "DRUGRES2"  
## [51] "DRUGRES3"   "HOSPITAL"   "DOA"        "DEATH_DA"   "DEATH_MO"  
## [56] "DEATH_YR"   "DEATH_HR"   "DEATH_MN"   "DEATH_TM"   "LAG_HRS"   
## [61] "LAG_MINS"   "P_SF1"      "P_SF2"      "P_SF3"      "WORK_INJ"  
## [66] "HISPANIC"   "RACE"       "LOCATION"

As you may be able to tell, these can be broken down into a few different types of categories info about the crash in general (STATE, DAY, HOUR, etc), info about the vehicle they were in (MAKE, BODY_TYP, etc.), info about the individual’s situation in the collision (SEAT_POS, INJ_SEV, EJECTION, DRUGS, etc.), and demographics (AGE, SEX, RACE, etc.).

To get started I have created four new features. SURVIVED is a clear labeling of if the person survived or died in the collision. With our data set having 35092 fatalities which matches the 35,092 released by the white house, we can validate our data somewhat. I also added COLLISION_DATE and COLLISION_DATETIME fields in more usable date formats in case they will be useful later. Finally I created the STATE_NAME column by mapping the state ids that are in the dataset to the corresponding state from the documentation.

There is obviously tons to be explored, so let’s dive in!

Who’s involved in these colisions?

It appears that it is most common for there to be two people involved in a collision whether is be two people in one vehicle, one in each of two vehicles, or one in a vehicle and a pedestrian. As you might expect the more people involved in one collision, the less rare it is although there seems to have been at least one collision with almost 100 people involved.

It seems that single vehicle crashes are most common.

I found it odd that there were any collisions with 0 vehicles but after looking at just cases that had no vehicles (I commented out the View statement but feel free to look yourself) it looks like they are all missing all vehicle information and the PER_TYPE labels them as non-motor vehicle occupants. My working suspicion is that these are pedestrians that were struck in a hit and run so the individual was found but the vehicle that struck them was not.

Finally, it is most common for there to be one person in each vehicle.

I’m starting to see why my insurance was so high in my teens and early twenties, there are many more people that age involved in collisions than older adults. I have also colored the histogram with the inverse of SURVIVED so the we can easily see the distribution of people that passed away in these accidents. However this would include passengers, let’s focus on drivers.

This actually flattens the distribution out a good bit which leads me to believe that younger people tend to ride along with others more often than older people do.

Speaking of insurance…

I can also see why being male leads to higher premiums! It is shocking how much more often males are drivers in fatal accidents. It looks to be at least twice as common for most ages.

When do accidents occure?

## # A tibble: 10 × 3
##    COLLISION_DATE crash_per_day weekday
##            <date>         <int>   <ord>
## 1      2015-05-03           142     Sun
## 2      2015-08-15           139     Sat
## 3      2015-10-11           139     Sun
## 4      2015-08-02           138     Sun
## 5      2015-09-05           138     Sat
## 6      2015-06-27           136     Sat
## 7      2015-07-25           136     Sat
## 8      2015-06-07           135     Sun
## 9      2015-07-26           132     Sun
## 10     2015-08-16           131     Sun

Besides all being weekend days, there doesn’t seem to be anything special about the ten days with the most collisions.

Along the same lines, Saturdays and Sundays seem to have the most collisions followed by Fridays.

I was curious to see if there are more collisions in November/December than other months since there are so many people traveling plus celebration drinks, this isn’t so straightforward though since different months have different number of days.

Interestingly, while this does flatten out the plot it doesn’t seem like the holiday months are especially high as I had expected. It almost looks more like in the colder months of the year people might not be out driving as much?

In the early morning hours crashes are at their lowest then they raise while people are commuting to work before dropping back down and then steadily rising throughout the day with the most dangerous time being 6pm while people are driving home from work.

By looking at the hours crashes occur by hour by day, it is very obvious that there are a lot of crashes on late on Fridays and Saturdays and early on Saturdays and Sundays.

WORK_INJ only shows when the person’s death certificate says that they were “at work” at the time of the crash. As you may expect, there seem to be the most incidents where people are working during the day on weekdays.

By splitting the seat position into two features, we can see how often people are in each seat and how many people survive in each seat.

Drug and Alcohol Use

I found it odd that there were so many drunk and sober passengers but it has occurred to me that perhaps police officers on the scene of a fatal accident have better things to do than check if passengers have been drinking.

This seems somewhat confirmed by only 20.82% of passengers having either alcohol involved or alcohol not involved reported for them.

I suppose it makes sense for there to be less drunk driving incidents in the middle of the day than in the evenings, but I am a bit surprised by how strongly accidents where driver drinking was a factor ran inversely to those where it wasn’t.

By looking at it per day, you get basically what you would expect with there being a slight increase in drunk driver crashes around dinner and happy hour time throughout the week, but with the big spikes in drunk driving being on the weekend late night/early mornings.

Another thing I hadn’t caught in the earlier plots of time per weekday is that you can basically see people sleeping in on the weekend morning compared to weekdays.

This is a really interesting one to unpack. I am pretty surprised to see how often there are more drugs and drinking than people only using drugs. It’s also interesting to me that while during the day all substance abuse collisions are a lot lower, almost no one is doing drugs and drinking mid-weekday (although there are still accidents with one or the other).

Child Restraints

It’s a relief to see that children seats do seem to keep children somewhat safe, compared the the 56.45% survival rate for everyone that is involved in crashes that have a fatality. Although with the amount of collisions that only have one person in each vehicle this comparison can’t really be made so directly.

On the bright side, it appears that most people are avoiding putting child seats in the front rows of vehicles!

I’m shocked by how many kids under 10 are in cars with just a seat belt or more terrifyingly with no restraint at all.

While most of my Googleing has suggested that children should be in a booster seat until they are 8-10 years old it looks like it’s actually rare for people to use child restraints that long.

With a higher percent of children surviving at every age if they were in a child restraint, it seems like we cant stress enough how important they are!

In an attempt not to sensationalize the 100% of 10 year olds in child restraints surviving, it should be noted that there were only 4 of them while there were 315 not in restraints.

Lag time

I originally thought that lag time was how long it took for an individual to get to the hospital. Unfortunately this field is a bit darker, since it actually is how long the person survived from the crash until they eventually passed away.

I thought it was interesting that the older one is, the longer they may survive before passing away. However there are tons of confounding factors with this for instance someone who is younger may be in a similar physical situation to someone older but the younger person may recover, which would mean they do not have a lag time. So I don’t think we can really draw any conclusions from this.

Final Plots and Summary

Plot 1

Description 1:

Accidents with alcohol involvement are much more frequent on the weekend evenings. They also increase some during the evenings throughout the week compared to the middle of the day, when there are very few fatal accidents with alcohol involvement. Accidents with drug or drug and alcohol involvement also are the highest during the weekend evening, although not nearly to the degree that alcohol is.

Plot 2

Description 2:

Child restraints are most commonly put in the second row of vehicles, which makes sense because it is recommended not to put them in the front row (especially with airbags) and most cars have two rows. The second most common is the third row (vans and SUVs). We can also see that most frequently the type of child restraint is not reported or unknown to the officer but forward facing seats are the most common known child restraints.

Plot 3

Description 3:

You can view this plot as a vehicles standard layout with the driver being in the top left corner, and having by far the highest fatality rate followed by the front right (“shotgun”) seat. Another interesting takeaway is that I have always heard that the driver side back seat is the safest in the car, but it appears to actually have the lowest survivor rate of any second or third row seat.

*It should be noted that this plot is biased since the data is only collisions in which there was a fatality and it is most common for there to only be one passenger, the driver, so you would expect their survival rate to be lower.

Reflection

The FARS report has an incredible amount of information in it, even with just focusing on the person data csv there were 80587 individuals with 75 features. To start trying to wrap my head around this data set, I started by exploring a few of the more prominent features individually, but with how much interesting information is available it is hard to resist diving into comparisons for very long. I ended up exploring a lot of relationships but the two main things most of them came back to is what are the survival rates in different situations and under what circumstances do accidents with fatalities happen.

There were some relationships that seemed pretty clear, such as using child restraints saving children’s lives and drugs and alcohol being a factor more frequently on weekend evenings. However with there being so many conflating factors to consider a lot of the relationships were less clear. One thing that was tricky with this dataset is that most of the features are categorical which seems to make regressions and comparisons harder to do.

There is still almost endless additional exploration of this data to be done ( which is probably why the Whitehouse put out the call to action). I think there are a lot of insights to be gained by combining more of the data files, for example being able to bring in weather conditions from the accident file and vehicle make/model from the vehicle file and combining it with our data about child restraints could give insights to what cars parents should avoid especially if they live somewhere that is particularly rainy or snowy.

Plot Count

Hi reviewer, to save you from having to count my plots here is a count of everything in Figs/: 26

I deleted the directory right before knitting (so no duplicate counts from changing r chunk names).